Improved inter-subject alignment of the lumbosacral cord for group-level in vivo gray and white matter assessments: A scan-rescan MRI study at 3T

Introduction Magnetic resonance imaging (MRI) enables the investigation of pathological changes in gray and white matter at the lumbosacral enlargement (LSE) and conus medullaris (CM). However, conducting group-level analyses of MRI metrics in the lumbosacral spinal cord is challenging due to variability in CM length, lack of established image-based landmarks, and unknown scan-rescan reliability. This study aimed to improve inter-subject alignment of the lumbosacral cord to facilitate group-level analyses of MRI metrics. Additionally, we evaluated the scan-rescan reliability of MRI-based cross-sectional area (CSA) measurements and diffusion tensor imaging (DTI) metrics. Methods Fifteen participants (10 healthy volunteers and 5 patients with spinal cord injury) underwent axial T2*-weighted and diffusion MRI at 3T. We assessed the reliability of spinal cord and gray matter-based landmarks for inter-subject alignment of the lumbosacral cord, the inter-subject variability of MRI metrics before and after adjusting for the CM length, the intra- and inter-rater reliability of CSA measurements, and the scan-rescan reliability of CSA measurements and DTI metrics. Results The slice with the largest gray matter CSA as an LSE landmark exhibited the highest reliability, both within and across raters. Adjusting for the CM length greatly reduced the inter-subject variability of MRI metrics. The intra-rater, inter-rater, and scan-rescan reliability of MRI metrics were the highest at and around the LSE (scan-rescan coefficient of variation <3% for CSA measurements and <7% for DTI metrics within the white matter) and decreased considerably caudal to it. Conclusions To facilitate group-level analyses, we recommend using the slice with the largest gray matter CSA as a reliable LSE landmark, along with an adjustment for the CM length. We also stress the significance of the anatomical location within the lumbosacral cord in relation to the reliability of MRI metrics. The scan-rescan reliability values serve as valuable guides for power and sample size calculations in future longitudinal studies.


Introduction
The lumbosacral spinal cord (SC) contains nuclei that innervate the lower limbs and pelvic organs.Pathological changes in the gray matter (GM) or white matter (WM) of the lumbosacral cord can lead to various dysfunctions such as motor and sensory impairments in the lower limbs, as well as lower urinary tract, sexual and bowel dysfunction [1][2][3].
In the lumbosacral cord, SC level-specific and group-level analyses are challenged by the mismatch between vertebral and neurological levels [21], which precludes the use of vertebral levels as neuroanatomical landmarks.Given the difficulty of identifying neurological levels in vivo in the lumbosacral cord [22], previous studies have suggested the use of the slice with the largest SC CSA, here referred to as the "LSE landmark", as an image-based neuroanatomical landmark [18,23].However, the intra-and inter-rater reliability of image-based landmarks have not been reported.Additionally, slice-wise group comparisons of MRI metrics are increasingly challenging toward the tip of the spinal cord due to the high inter-subject variability in the CM length [18].
Automatic SC and GM segmentation techniques have been shown to perform well for the cervical cord [24] but have not yet been optimized for the lumbosacral cord.This is primarily attributed to the smaller size and lower signal-to-noise ratio, as well as the close proximity of nerve roots, which can compromise SC segmentation.Consequently, manual segmentation is still the standard segmentation technique for the lumbosacral cord.For healthy volunteers, the feasibility of manual GM and WM segmentation has been demonstrated within the LSE [23] and CM [18], while the feasibility of diffusion tensor imaging (DTI) has been shown within the LSE [25].However, none of these studies reported scanrescan values for the CM.
The aim of this study is twofold.First, to facilitate group-level analyses, we aimed to improve the inter-subject alignment of the lumbosacral cord by comparing GM-and SCbased landmark definition methods and proposing a method to adjust for the individual CM length.Second, we computed intra-rater, inter-rater, and scan-rescan reliability of the SC, GM, and WM CSA, as well as scan-rescan reliability of the DTI metrics, within both LSE and CM, and both in healthy volunteers and patients.Here, our focus was on measuring the reliability of MRI metrics in a patient cohort where the lumbosacral cord is not directly affected.For this reason, we included patients with a spinal cord injury (SCI) in the cervical or thoracic cord, who represent a challenging imaging cohort as they often experience higher levels of involuntary motion (e.g., due to spasticity).

Study participants
Fifteen participants including 10 healthy volunteers (4 females, 6 males, age (mean ± standard deviation (SD)): 32.9 ± 10.8 years) and 5 SCI patients with cervical or thoracic injuries, but without injury-related radiological structural abnormalities in the lumbosacral cord (1 female, 4 males, age: 46.1 ± 20.3 years, time since injury: 5.9 ± 0.1 months) participated in this study.SCI patients had neurological impairments with mixed aetiologies, injury levels, and severities (see S1 Table for demographic and clinical information).The patient cohort was part of a randomized controlled trial investigating the effect of transcutaneous tibial nerve stimulation on the emergence of neurogenic lower urinary tract dysfunction following SCI (TASCI, ClinicalTrials.govidentifier: NCT03965299) [26].The main inclusion criteria for all participants were: (i) no MRI contraindications, (ii) no preexisting neurological or mental disorders, and (iii) > 18 years of age.For a comprehensive list of inclusion and exclusion criteria specific to SCI patients, refer to [26].Healthy volunteers were scanned twice, with an interval of (mean ± SD) 8.4 ± 3.0 weeks (range: 6-15 weeks) between scans.Participants were recruited between 17.04.2020and 14.02.2022.The study was approved by the local ethics committee (Kantonale Ethikkommission Zu ¨rich, BASEC ID: 2019-00074) and conducted in accordance with the Declaration of Helsinki.Written informed consent was obtained from all participants.

MRI acquisition
Scanning was performed on a 3T Siemens Prisma MRI scanner (Siemens Healthineers, Erlangen, Germany) using a body transmit coil and a 32-channel spine matrix coil.Foam wedges were placed beneath the knees to minimize the lower spine natural lordotic curve and maximize contact between the lower back and the spine matrix coil.Motion artifacts in the lower back area were reduced by placing the legs onto a spine vacuum cushion and applying Velcro straps around the knees, hips, and chest.
A sagittal T2-weighted turbo spin echo sequence was acquired as an anatomical reference of the lumbosacral cord as described in [17].Subsequently, axial T2*-weighted images were acquired using a 3D spoiled multi-echo gradient-echo sequence (ME-GRE, Siemens FLASH) with 20 axial-oblique slices of 5 mm thickness (no gap) (Fig 1).A saturation band was placed anterior to the spine to suppress potential artifacts from the abdomen.The sequence parameters were as described in [17].
Images for diffusion MRI were acquired using a reduced field of view (FOV) single-shot spin-echo echo planar imaging (EPI) sequence with 15 slices of 5 mm thickness (no gap) (Fig 1).To prevent fold-over artifacts, two saturation bands were respectively placed anterior and posterior to the FOV.The sequence consisted of 180 diffusion-weighted (b = 800 s/mm 2 ) and 6 T2-weighted (b = 0 s/mm 2 ) images with an in-plane resolution of 0.9x0.9mm 2 , in-plane FOV of 86x32 mm 2 , repetition time of 440 ms, echo time of 56 ms, no parallel imaging, 7/8 phase partial Fourier in the anterior-posterior phase-encoding direction, and bandwidth of 1270 Hz/pixel.The acquisition was cardiac gated, acquiring 3 slices per cardiac cycle with a trigger delay of 120 ms.The total acquisition time depended on the heart rate and was approximately 12 min.

Processing of ME-GRE images
For each repetition, the first three echoes of the ME-GRE scan were combined via root-meansquares, as it has been demonstrated to be the optimal combination for segmenting the SC and GM within the same image [17].The resulting combined echo was then averaged across all repetitions.The SC and GM were manually segmented according to a standard operating procedure (SOP), made available on GitHub (https://github.com/NeuroimagingBalgrist/LumbosacralCordMRI), using the sub-voxel segmentation tool in JIM 7.0 (Xinapse systems), providing corresponding CSA values.WM CSA was obtained by subtracting GM CSA from SC CSA.Sub-voxel segmentations were binarized at an inclusion threshold of 100% for SC and 50% for GM to create binary SC and GM masks.A binary WM mask was created by subtracting the binary GM mask from the binary SC mask.
The three raters were instructed to segment the SC and GM by drawing an isointense contour within the partial volumes along the edges of the SC and GM, respectively, while also considering the anatomical shape and smoothness of the SC and GM.The raters had extensive experience in manually segmenting the lumbosacral cord, having individually segmented over 100 images prior to this study.To establish consensus guidelines for segmentation, the raters initially segmented a separate training set comprising three healthy volunteers, which was not included in the main analysis.These initial segmentations were then compared, and any disagreements were resolved through discussion among the raters.The resulting consensus guidelines were added to the SOP.

Processing of diffusion MRI images
The diffusion MRI images were processed using the ACID toolbox [27].All images were cropped to an in-plane FOV of 32x32 mm 2 .Eddy current and motion correction was performed using ECMOCO [28] with a 3-degrees-of-freedom (DOF) volume-wise registration (translation along x and y; scaling along y (with x and y being the left-right and anterior-posterior direction, respectively)), followed by a 2-DOF slice-wise registration (translation and scaling along y).Images underwent adaptive smoothing using msPOAS [29], with parameters k* = 5 and lambda = 10, to increase the signal-to-noise ratio without introducing blurring across tissue edges.The diffusion tensor model was then fitted on the corrected images using a robust tensor fitting algorithm [30,31] to generate maps of fractional anisotropy (FA), mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD).
The mean corrected diffusion-weighted image was manually segmented for SC in FSLeyes and was spatially normalized to the PAM50 template [32] resulting in both forward (native to template space) and backward (template to native space) warping fields.Normalization was aided by labeling both the slice with the largest SC CSA and the most caudal slice of the SC (determined from the ME-GRE sequence), which were aligned with the corresponding labels in the PAM50 template (disc labels 19 and 21).The probabilistic WM atlas, integrated into the PAM50 template, was warped into the native space using the obtained backward warping field.Using sct_extract_metric from the Spinal Cord Toolbox (v.5.8) [33], weighted average values of DTI metrics were extracted slice-wise within the GM, WM, as well as the dorsal, lateral, and ventral WM columns.As the PAM50 atlas does not encompass the caudal half of the CM, DTI metrics were extracted only from the upper half of the CM.

Comparison of neuroanatomical landmark definition methods
The LSE landmark as an image-based neuroanatomical landmark was defined either as the slice with the largest SC CSA or the largest GM CSA.The curves of slice-wise CSA values (see Fig 3 for examples) were either unsmoothed or underwent smoothing by moving window averaging across three adjacent slices (corresponding to 15 mm).The choice of three slices was driven by the intrinsic smoothness of the curves of slice-wise CSA values and the need for selecting an uneven number of slices (thus avoiding the need for interpolation).Overall, this resulted in a total of four LSE landmark definition methods: i. SC max : slice with the largest SC CSA without moving window averaging ii.SC max,mw : slice with the largest SC CSA with moving window averaging iii.GM max : slice with the largest GM CSA without moving window averaging iv.GM max,mw : slice with largest GM CSA with moving window averaging To assess the reliability of the LSE landmarks, the mean absolute deviation (MAD) of the determined slices was calculated across three sets of segmentations performed by the same rater (intra-rater reliability) and across segmentations performed by three different raters on the same set (inter-rater reliability) (see Section 2.7 for segmentation procedures).
Identifying the tip of the spinal cord (CMtip landmark) was difficult due to the large slice thickness (5 mm).Therefore, it was determined by extrapolating the curve of slice-wise SC CSA values to zero.Note that the most caudal slice where SC was segmented was usually 1-2 slices above the CMtip landmark.

Adjusting for the conus medullaris length
To adjust for the individual CM length, we divided the CM, i.e., space between the two landmarks (LSE and CMtip), into five segments of equal thickness in each subject.For example, if the CM consisted of 8 slices (40 mm) in length, each of the five segments comprised 8/5 = 1.6 slices (equivalent to 8 mm).Without reslicing the images, we extracted CSA values and DTI metrics within each segment, computed as a weighted average of the slice-wise values, where the weights represent the spatial contribution of each slice to the particular segment.This allowed us to determine MRI metrics within segments centered at the LSE and the CMtip landmarks, respectively, four segments located between them, and three segments rostral to the LSE landmark (Fig 2).

Intra-and inter-rater reliability of cross-sectional area measurements
A total of 10 participants including 5 healthy volunteers (1 female, 4 males, age (mean ± SD): 41.4 ± 8.6 years) and 5 patients with SCI (1 female, 4 males, age: 46.1 ± 20.3 years) were included in this analysis.No images had to be discarded due to motion or other artifacts.Segmentation of the axial ME-GRE images was performed by three raters (S.B., G.D., M.D.L.).Two raters performed the segmentation three times and one rater performed it once, with a minimum of two weeks between each round of segmentation.The order of segmentation in each round was pseudo-randomized into three blocks using a computer-generated randomization list to ensure a balanced distribution of patients and healthy volunteers.Raters were provided with the pseudonymized set of images (ME-GRE of the lumbosacral cord).The raters were not provided with images covering the injury site, where the participant status would have been visible.Since the patients did not exhibit any injury-related radiological abnormalities or implant-related artifacts in the lumbosacral cord, raters were unable to discern whether the participant was a healthy volunteer or a patient.The LSE landmark, defined as the GM max, mw slice, was determined in each subject as the median across three raters.
Slice-wise coefficient of variation (CV) of SC, GM, and WM CSA was calculated as the percent ratio between the standard deviation and mean of CSA values either (i) across three sets of segmentations performed by the same rater (intra-rater reliability) or (ii) across segmentations performed by three different raters on the same set (inter-rater reliability).The CV was then averaged across all subjects (n = 10) and separately for patients (n = 5) and healthy volunteers (n = 5).
Intra-and inter-rater intraclass correlation coefficients (ICC) were simultaneously computed using Eliasziw's two-way mixed effects, absolute agreement, single-measure model [34], as implemented in the relInterIntra function of the irr package in R. Two-sided (upper and lower bounds) 95% confidence intervals were additionally calculated according to the formulas in [35].The Dice coefficient represents the spatial overlap between masks and was computed as the size of the union of two (binary) segmentation masks, created either by the same rater The acquired slices are illustrated by rectangular gray boxes (slice thickness of 5 mm).First, two image-based neuroanatomical landmarks (LSE, defined here as the slice with the largest crosssectional GM area, and the CMtip) are determined in each subject (indicated by asterisks).Then, the space between these landmarks are divided into 5 segments of equal thickness (resulting in segment thicknesses of 12, 10, 8, and 6 mm).The segments, displayed as colored spinal cord regions, are defined such that one segment (light blue) is centered at the LSE landmark (segment LSE), another one at the tip of the spinal cord (segment LSE-5), and the space between them is covered by four segments.The spinal cord rostral to the LSE landmark is also divided into segments using the same segment thickness.Average values are extracted within each segment as a weighted average of the slice-wise values, where the weights represent the spatial contribution of each slice to the segment.If a value for a slice which contributes more than 25% to the segment was not available, the value for that segment was not calculated.https://doi.org/10.1371/journal.pone.0301449.g002 (intra-rater Dice coefficient) or different raters (inter-rater Dice coefficient), divided by the average size of the two segmentations masks.Values range from 0 (no overlap) to 1 (perfect overlap).Single Dice coefficients were calculated by averaging pairwise Dice coefficients for each subject, and then further averaged across all subjects.

Scan-rescan reliability of cross-sectional area measurements and diffusion tensor imaging
The reliability analysis across both imaging sessions (scan and rescan) included the 10 healthy volunteers.Patients were not included, as their scan-rescan data might be affected by diseaserelated longitudinal changes.No images had to be discarded due to motion or other artifacts.Segmentation was performed by a single rater (S.B.).The LSE landmark was defined as the GM max,mw slice.
We calculated the mean of the scan-rescan differences ( � d), along with the 95% limits of agreement ( � d � 1:96 � SD).Scan-rescan CV was calculated as the percent ratio between the standard deviation and mean of MRI metrics across scan and rescan.Scan-rescan ICC was computed using a two-way mixed effects, absolute agreement, single-measure model using the icc function of the irr package in R. The minimal detectable change (MDC), or smallest real difference represents the smallest longitudinal change that be confidently attributed to a meaningful change rather than mere scan-rescan variability.The MDC with 95% confidence threshold were calculated as MDC ¼ 1:96 � ffi ffi ffi 2 p � SD total � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi , where SD total denotes the total standard deviation (taken across all measurements) and ICC denotes the scan-rescan ICC.While MDC describes the minimal detectable longitudinal change in an individual, it is important to note that group studies can detect even smaller longitudinal changes by utilizing multiple subjects.

Comparison of neuroanatomical landmark definition methods
Visual comparison of different LSE landmarks is provided in Fig 3 .The inter-rater MAD values of the LSE landmarks were 0.58 for SC max , 0.42 for SC max,mw , 0.29 for GM max , and 0.22 for GM max,mw .Intra-rater MAD values were consistently the same or lower than corresponding inter-rater values (0.16 for SC max , 0.19 for SC max,mw , 0.29 for GM max , and 0.11 for GM max,mw ).The lowest intra-and inter-rater MAD values were achieved for GM max,mw .The GM max,mw slice was located (mean ± SD) 8.5 ± 4.5 mm more caudally than the SC max,mw slice.The inter-subject variability of SC and GM CSA was lower at caudal locations when using the GM max,mw slice as LSE landmark; however, no clear trend was observed at rostral locations (S2 Table , Fig

Adjusting for the conus medullaris length
After adjusting for the CM length, the inter-subject CV of CSA measurements remained similar at and around the GM max,mw slice, but decreased substantially caudal to it (Table 1).For example, the inter-subject CV of SC CSA was 31.5% five slices (25 mm) below the GM max,mw slice (before adjustment), and it reduced to 17.4% after adjustment at roughly the same anatomical location (segment LSE-3) (compare values highlighted in bold in Table 1).  2 lists the intra-and inter-rater reliability metrics for the CSA measurements.Intra-rater reliability was higher than inter-rater reliability for all CSA values and slices, indicated by lower CV and higher ICC and Dice coefficients.Intra-and inter-rater reliability was in general higher for SC CSA than for GM and WM CSA.Intra-and inter-rater CV increased considerably from the GM max,mw slice toward the tip of the spinal cord, accompanied by a decrease in Dice coefficients.Notably, such a trend could not be observed in the ICC.Intra-rater CV values were similar between healthy volunteers and patients, but inter-rater CV values were higher in the patient group, especially at caudal locations (S3 Table ).

Scan-rescan reliability of cross-sectional area measurements and diffusion tensor imaging
There was no systematic bias, as measured by � d, between the scan and rescan values for any of the MRI metrics (Tables 3-5, S1 Fig) .In general, scan-rescan reliability was higher for SC CSA than for GM and WM CSA, indicated by lower CV and MDC, and higher ICC values The benefit of using the slice with the largest GM CSA (GM max,mw ) as LSE landmark, as opposed to the slice with the largest SC CSA (SC max,mw ), is evident in Subject 1, with no inter-rater variability for the GM max,mw slice.Subject 2 demonstrates the advantage of using moving window averaging: There is no interrater variability in GM max,mw and SC max,mw slice after applying moving window averaging.https://doi.org/10.1371/journal.pone.0301449.g003(Table 3).Among the DTI metrics, RD tended to have the lowest scan-rescan reliability.In general, reliability was higher for DTI metrics when extracted within the entire WM rather than separate WM columns.Reliability (with the exception of ICC values for DTI metrics) decreased considerably from the LSE segment toward the tip of the SC.Without adjusting for the CM length, the scan-rescan CV values for DTI metrics were higher (compare Tables 3-5 with S4-S6 Tables, respectively).The scan-rescan reliability decreased only minimally (i.e., the increase in CV was, on average, below 1 percentage point for MRI metrics extracted within the WM) when the landmarks were determined independently for scan and rescan, as opposed to when they were determined in the first scan (compare Tables 3-5 with S7-S9 Tables, respectively).

Discussion
We investigated methods for improving the inter-subject alignment of axial slice stacks within the lumbosacral cord.We found that the slice with the largest gray matter (GM) cross-sectional area (CSA) can serve as a reliable image-based neuroanatomical landmark for the lumbosacral enlargement (LSE).Adjusting for the conus medullaris (CM) length by dividing the CM into a fixed number of segments and extracting MRI metrics from those segments substantially reduced inter-subject variability.Additionally, we reported different aspects of reliability for CSA measurements and diffusion tensor imaging (DTI) metrics within the lumbosacral cord at 3T.We found that intra-rater, inter-rater, and scan-rescan reliability was the highest at and around the LSE landmark, and decreased caudal to it.

Slice with the largest gray matter cross-sectional area is a reliable image-based neuroanatomical landmark
We found that the slice with the largest GM CSA, as opposed to the slice with the largest spinal cord (SC) CSA, can be identified more consistently across raters and across repeated assessments of the same rater.This is likely because of the sharper peak in the curves of slice-wise GM CSA values at the LSE compared to the flatter peak in SC CSA (Fig 3).Sharper peaks are easier for raters to identify than flatter peaks.Further improvement in the reliability of imagebased landmarks can be achieved by smoothing the curves of slice-wise CSA values through moving window averaging across three adjacent slices, which reduces fluctuations inherent in the slice-wise CSA values.
Furthermore, when aligning the individual axial slice stacks at the LSE landmark determined based on GM CSA, as opposed to SC CSA, we noticed a slight reduction in inter-subject variability of CSA values at caudal locations.The reason for this reduction remains unclear.We argue that the slice with the largest GM CSA is more closely associated with the neurological level than the slice with the largest SC CSA, considering that the LSE is neuroanatomically attributed to the enlargement of the GM.
Notably, a previous study based on post-mortem MRI utilized distinct morphological features of the ventral GM horns to characterize the lumbosacral cord across species [36].Another study revealed a strong link between these morphological features and the motor neuron pools located within the ventral GM horns [37].We anticipate that, for in vivo studies, the shape of GM would serve as an even better neuroanatomical landmark than its size.In a recent Notes: CSA values represent mean ± standard deviation across 10 healthy volunteers.Individual axial slice stacks were aligned at the LSE landmark, defined as the slice with the largest gray matter CSA (GM max,mw ).Values are displayed for both original slices (not adjusted for the length of the conus medullaris) and segments (adjusted for the length of the conus medullaris).While a direct correspondence between the original slices and segments is not possible due to their different thickness, the rows highlighted in bold roughly represent the same anatomical location.A positive distance indicates a rostral direction from the LSE landmark.Values were derived from the first scan; highly comparable results were obtained from the second scan (rescan).
Abbreviations: CV, inter-subject coefficient of variation; CSA, cross-sectional area; LSE, lumbosacral enlargement. https://doi.org/10.1371/journal.pone.0301449.t001 study, a distinct approach was used to directly identify neurological levels through nerve root tracing [38].However, this approach requires the acquisition of an additional image optimized for nerve segmentation, thereby increasing the scan time, and might not work reliably in all subjects.Future research should focus on identifying neurological levels directly from the axial ME-GRE images.

Improved inter-subject alignment after adjusting for the conus medullaris length
To adjust for the considerable inter-subject variation in CM length, we divided the CM into a fixed number of segments and extracted MRI metrics from these segments using a weighted average of the slice-wise values.This adjustment substantially reduced the inter-subject variability of MRI metrics at caudal locations.This is beneficial for group-level analyses, as  2. SC and GM segmentations are shown on alternating slices for display purposes.For the displayed SC and GM segmentations, the coefficient of variation (CV) values of the corresponding SC cross-sectional areas (CSA) were (intra-vs.inter-rater) 0.8% vs. 3.7% (slice 17), 2.4% vs. 3.2%, 1.2% vs. 9.9%, 4.8% vs. 12.3%, and 6.8% vs. 6.7% (slice 9).The CV values of the corresponding GM CSA were (intra-vs.inter-rater) 1.8% vs. 9.1% (slice 16), 1.5% vs. 9.1%, 2.4% vs. 8.9%, 6.5% vs. 10.9%, and 3.6% vs. 9.5% (slice 8). https://doi.org/10.1371/journal.pone.0301449.g005 reduced inter-subject variability increases statistical power for the same sample size or necessitates a smaller sample size for the same statistical power.We note that another line of research is concerned with reducing inter-subject variability of MRI metrics that arise from biological variability, using regression models incorporating demographic information, spine, and SC metrics [18,39]; however, this was not the focus of our investigation.The decision of dividing the CM into five segments was determined by both the slice thickness (5 mm) and the range of CM lengths observed within the healthy population [18].By using five segments, we ensured that segments were thicker than the slice thickness in all participants, avoiding the need for interpolating within slices.While thinner slices would allow for a higher number of segments and increased spatial specificity, this comes at the cost of reduced scan-rescan reliability.Conversely, dividing the CM into fewer segments would likely enhance scan-rescan reliability, but it would come at the expense of specificity along the rostro-caudal axis.
Spinal cord segments, created by correcting for the individual CM length, do not correspond to specific neurological levels.Instead, each segment consists of a combination of neurological levels, where more caudal segments encompass more neurological levels due to their progressively shorter length [40].Nevertheless, this approach is still consistent with * Indicates significant difference between scan and rescan (p < 0.05).

Notes:
The individual axial slice stacks were aligned at the LSE landmark, defined as the slice with the largest gray matter CSA (GM max,mw ), and were adjusted for the length of the conus medullaris.The landmarks were determined in the first scan.A positive segment indicates a rostral direction from the LSE landmark.
Abbreviations: CI, confidence interval; CV, scan-rescan coefficient of variation; CSA, cross-sectional area; � d � , mean scan-rescan difference; ICC, scan-rescan intraclass correlation coefficient; LSE, lumbosacral enlargement; MDC, minimal detectable change; SD, standard deviation.https://doi.org/10.1371/journal.pone.0301449.t003neurological levels as long as their distribution within each segment remains constant across subjects.In addition, while the adjustment method was demonstrated on ME-GRE images, it can be applied to other MRI sequences as well.* Indicates significant difference between scan and rescan (p < 0.05).

Notes:
The individual axial slice stacks were aligned at the LSE landmark, defined as the slice with the largest gray matter CSA (GM max,mw ), and were adjusted for the length of the conus medullaris.The landmarks were determined in the first scan.A positive segment indicates a rostral direction from the LSE landmark.

Notes:
The individual axial slice stacks were aligned at the LSE landmark, defined as the slice with the largest gray matter CSA (GM max,mw ), and were adjusted for the length of the conus medullaris.The landmarks were determined in the first scan.A positive segment indicates a rostral direction from the LSE landmark.

Excellent intra-rater reliability of cross-sectional area measurements at the lumbosacral enlargement
We demonstrated excellent intra-rater reliability for CSA measurements, when using the previously optimized ME-GRE imaging protocol [17], with intra-rater CV values below 4%, 5%, and 6% for SC, GM, and WM CSA, respectively, at and within a 15 mm radius of the LSE landmark.Notably, the inter-rater CV values for CSA measurements were approximately twice as high as the corresponding intra-rater CV values.Consequently, we strongly advocate for the segmentation of all images by the same rater.
We also highlight the importance of anatomical location for the reliability of CSA measurements.More caudal slices within the CM had lower intra-and inter-rater reliability, in line with a previous report [18].This observation is likely attributed to the conus medullaris becoming thinner; for example, a minor difference between two segmentations has a larger impact when the area being segmented is smaller.Despite the increasing CV values toward the tip of the spinal cord, the intra-and inter-rater ICC values exhibited relatively consistent trends along the CM.This is because higher CV values are offset by higher inter-subject variability at more caudal locations (Table 1).
While intra-rater CV values for SC, GM, and WM CSA measurements were, on average, less than 1 percentage point higher in patients with spinal cord injury compared to healthy controls, the inter-rater CV values were, on average, between 2 and 5 percentage points higher, with the largest differences occurring caudal to the LSE landmark.This finding suggests that, for patients, the lower quality of the ME-GRE images increases the inter-rater, but not the intra-rater variability of SC and GM segmentations.The lower quality for patients may be attributed to a higher level of involuntary motion (4 of 5 patients reported spasticity in the lower limbs).While 4 of 5 patients had spinal instrumentation, they were not in close proximity to our field of view (FOV) and are therefore unlikely to affect the image quality.
The higher values for SC CSA, in comparison to GM and WM CSA, reflect the experience of the raters that SC segmentation is an "easier" task than GM segmentation.This is due to the fact that the contrast between the GM and WM is typically lower than between the WM and cerebrospinal fluid on ME-GRE images, and GM exhibits more ambiguity owing to its irregular shape [17].The reliability of CSA measurements depends on the confidence with which manual segmentation is performed, referred to as "segmentability".SC and GM segmentability is influenced by various factors, including the rater's experience, the subject being investigated, the imaging hardware, pulse sequence, and sequence parameters [17].Consequently, the comparability of reliability values across studies is limited by variations in these factors.In comparison to values obtained on a 3T Philips MRI scanner [18], our study showed lower intra-rater (1.6% vs. 3.5%) but higher inter-rater CV for SC CSA (4.7% vs. 3.0%) at the LSE landmark within healthy volunteers.The CV values for GM CSA were found to be lower in our study for both intra-rater (4.0% vs. 7.7%) and inter-rater (6.1% vs. 9.9%) analyses.

Scan-rescan reliability depends on the anatomical location within the lumbosacral cord
Scan-rescan reliability was generally lower compared to the corresponding intra-rater reliability.This is expected, as in addition to the variabilities captured by intra-rater reliability, scanrescan reliability also encompasses variabilities arising from subject positioning, the position of the imaged organ, FOV positioning, and potential imaging hardware instabilities (e.g., scanner drift).The time interval between scan and rescan is another important factor; longer intervals are associated with lower reliability.In our study, we deliberately selected a relatively long time interval of 6 to 15 weeks to mimic time frames commonly encountered in longitudinal clinical studies.Scan-rescan reliability was generally higher compared to the corresponding inter-reliability, which further emphasizes the importance of employing the same rater for segmentation.
The scan-rescan reliability of DTI metrics was found to be higher (i) in slices with larger CSA, particularly at and around the LSE landmark, as opposed to more caudal slices, and (ii) within larger regions (such as WM), compared to smaller ones (such as dorsal and ventral WM).This is because averaging DTI metrics within a larger region reduces the impact of individual outliers, small anatomical variations, and partial volume effects.Scan-rescan reliability can be further improved by averaging DTI metrics across several adjacent slices or segments, although this comes at the cost of decreased spatial specificity.For example, when averaging across three adjacent slices around the LSE landmark, as in [23,25], we observed a reduction in scan-rescan CV from 3.4% to 2.2% for WM CSA and from 4.2% to 2.7% for FA within the WM.When adjusting for the CM length, a small improvement in the scan-rescan reliability for DTI metrics was observed, which is probably due to the larger thickness of the segments (1-2 slices) compared to a single slice.
The DTI metrics exhibited lower scan-rescan reliability in comparison to CSA measurements.This is because diffusion MRI is inherently noisier than structural MRI, and, unlike CSA measurements, the average DTI metrics within a ROI are not solely determined by the size of the ROI, but also by its location.For example, if two SC segmentations do not fully overlap but have the same area, they would yield the same SC CSA, yet different average DTI metrics within the SC.Nevertheless, we demonstrated a very high overlap between segmentation masks created by the same rater, as measured by the Dice coefficient, with values above 91% for SC, 90% for GM, and 76% for WM segmentation.
Despite differences in the cohort, MRI scanner, sequence, and processing pipeline, our scan-rescan CV for SC CSA at the LSE landmark is similar to that reported previously in healthy volunteers scanned on a 3T Philips MRI scanner (vs.Yiannakas et al., 2014 [23]; 1.8% vs. 2.0%).However, our CV for GM CSA was notably lower (2.9% vs. 7.8%).For DTI metrics extracted within the WM at the LSE landmark, our CV values were in line with previously reported values for FA (vs.Yiannakas et al., 2016 [25]: 2.8% vs. 6.0%),MD (5.5% vs. 5.0%), AD (4.7% vs. 5.4%), and RD (6.4% vs. 8.3%).The provided scan-rescan reliability values (Tables 3-5) serve as valuable guides for power and sample size calculations in future longitudinal studies.For example, we can calculate, using the equations in [41], that detecting a 3% change over time in GM CSA at the LSE with a power of 80% and significance of 5% (one-sided) would require 12 subjects.

Limitations
In contrast to CSA values, we did not separately investigate intra-and inter-rater reliability for DTI metrics.This is because while obtaining CSA values involves a single manual step (SC and GM segmentation in the ME-GRE images), the processing pipeline for DTI requires several manual interventions (selection of SC midpoint, SC segmentation in both the EPI and ME-GRE images), making it challenging to isolate individual effects.However, we argue that the reported scan-rescan reliability values encompass the combined effect of all these sources of variability and are therefore most relevant for planning future studies.Additionally, DTI metrics could not be extracted from the caudal half of the CM, as this region is currently not covered by the PAM50 atlas.This is a recognized issue within the community and is likely to be addressed in future developments.A limitation of the study is the relatively small sample size; however, this sample size is typical in method development and scan-rescan studies.Moreover, the findings may not be generalizable to patient populations with pathologies (e.g.lesions) within the lumbosacral cord.Another limitation is the considerable length of the MRI examination; however, the acquisition time of the ME-GRE images can be reduced for clinical applications [17].

Conclusions
We provide recommendations for improved inter-subject alignment within the lumbosacral cord to facilitate group-level analyses of MRI metrics.Specifically, we propose using the slice with the largest gray matter cross-sectional area as a reliable image-based neuroanatomical landmark, along with an adjustment method for the length of the conus medullaris.We emphasize the importance of anatomical location for intra-rater, inter-rater, and scan-rescan reliability, which were the highest at the lumbosacral enlargement and decreased toward the tip of the spinal cord.The provided scan-rescan reliability values serve as valuable guides for power and sample size calculations in future longitudinal studies.

Fig 1 .
Fig 1. (A) The axial slice stacks of the 3D spoiled multi-echo gradient-echo (ME-GRE) sequence (indicated in green) and the diffusion MRI sequence (indicated in yellow), overlaid on the sagittal T2-weighted image.The field of view of the ME-GRE scan was set such that its 6 th most rostral slice (slice #15) aligned with the maximum width of the spinal cord as observed in the sagittal T2-weighted scan (slice highlighted in red across all images).The slice stack of the ME-GRE scan covers the lumbosacral enlargement (LSE) and the entire conus medullaris (CM), while that of the diffusion MRI is smaller and centered at the 9 th most rostral slice (slice #12) of the ME-GRE scan.(B) Corresponding axial slices of the ME-GRE scan.(C) Corresponding axial slices of the maps of diffusion tensor imaging metrics, including fractional anisotropy (FA), mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD).Axial slices are displayed in rostral (top left) to caudal (bottom right) direction.https://doi.org/10.1371/journal.pone.0301449.g001

Fig 2 .
Fig 2. Adjustment for the individual conus medullaris (CM) length (60, 50, 40, and 30 mm for the displayed cases).The acquired slices are illustrated by rectangular gray boxes (slice thickness of 5 mm).First, two image-based neuroanatomical landmarks (LSE, defined here as the slice with the largest crosssectional GM area, and the CMtip) are determined in each subject (indicated by asterisks).Then, the space between these landmarks are divided into 5 segments of equal thickness (resulting in segment thicknesses of 12, 10, 8, and 6 mm).The segments, displayed as colored spinal cord regions, are defined such that one segment (light blue) is centered at the LSE landmark (segment LSE), another one at the tip of the spinal cord (segment LSE-5), and the space between them is covered by four segments.The spinal cord rostral to the LSE landmark is also divided into segments using the same segment thickness.Average values are extracted within each segment as a weighted average of the slice-wise values, where the weights represent the spatial contribution of each slice to the segment.If a value for a slice which contributes more than 25% to the segment was not available, the value for that segment was not calculated. 4).

Fig 5
Fig 5  illustrates the intra-and inter-rater variability in SC and GM segmentations, while Table2lists the intra-and inter-rater reliability metrics for the CSA measurements.Intra-rater reliability was higher than inter-rater reliability for all CSA values and slices, indicated by lower CV and higher ICC and Dice coefficients.Intra-and inter-rater reliability was in general

Fig 3 .
Fig 3. Curves of slice-wise cross-sectional area (CSA) of the spinal cord (SC) and gray matter (GM), obtained by three different raters in two subjects, with and without applying moving window averaging across 3 adjacent slices.The LSE landmarks, as determined by the raters, are indicated by asterisks.The benefit of using the slice with the largest GM CSA (GM max,mw ) as LSE landmark, as opposed to the slice with the largest SC CSA (SC max,mw ), is evident in Subject 1, with no inter-rater variability for the GM max,mw slice.Subject 2 demonstrates the advantage of using moving window averaging: There is no interrater variability in GM max,mw and SC max,mw slice after applying moving window averaging.

Fig 4 .
Fig 4. Inter-subject mean (solid line) and standard deviation (shaded area) of the cross-sectional area (CSA) of the spinal cord (SC) and gray matter (GM), computed across 10 healthy volunteers, when aligning the individual slice stacks at the slice either with the largest SC CSA (SC max,mw , in red) or GM SCA (GM max,mw , in blue), without adjusting for the length of the conus medullaris.The inter-subject variability at caudal locations was slightly lower when aligning at the GM max,mw slice, as seen by the smaller width of the blue shaded areas.https://doi.org/10.1371/journal.pone.0301449.g004

Table 2 . Slice-wise intra-and inter-rater reliability of cross-sectional area measurements (n = 10; 5 healthy volunteers and 5 patients with spinal cord injury).
CSA values represent an average across values obtained by three raters, based on the first set of segmentation of each rater.The individual axial slice stacks were aligned at the LSE landmark, defined as the slice with the largest gray matter CSA (GM max,mw ), without adjustment for the length of the conus medullaris.A positive distance indicates a rostral direction from the LSE landmark.For a single subject, GM and WM CSA values were not available for slices with coordinates +20 and -30 mm (n = 9).